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Abstract 


Recentlv. lSutton et al.l (12015h introduced the emphatic temporal differences (ETD) 
algorithm for off-policy evaluation in Markov decision processes. In this short 
note, we show that the projected hxed-point equation that underlies ETD involves 
a contraction operator, with a ^/ 7 -contraction modulus (where 7 is the discount 
factor). This allows us to provide error bounds on the approximation error of ETD. 
To our knowledge, these are the hrst error bounds for an off-policy evaluation 
algorithm under general target and behavior policies. 


1 Introduction 

In Reinforcement Learning (RL; ISutton & BartollT998h . policy-evaluation refers to the problem of 
evaluating the value function - a mapping from states to their long-term discounted return under a 
given policy, using sampled observations of the system dynamics and reward. Policy-evaluation is 
imp ortant both for assessing the quality of a policy, but also as a sub-procedure for policy optimiza¬ 
tion (ISutton & Bartol[l998h . 

Eor systems with large or continuous state-spaces, an exact computation of the value function is often 
impossible. Instead, an approxima te value-function is sought using various fu nction-approxima tion 
techniques (ISutton & BartollT998L a.k.a. approximate dynamic-programming: !BertsekMl 2012 h . In 
this approach, the parameters of the value-function approximation are tuned using machin e-learning 
inspired methods, often based on the temporal-difference idea (TD lSutton & BarMll998l) . 

The method generating the sampled data leads to two different types of policy evaluation. In 
the on-policy case, the samples are generated by the target-policy - the policy under evaluation, 
while in the off-policy setting, a different behavior-policy generates the data. In the on-policy set¬ 
ting, TD methods are well understood, with classic convergence guarantees and approximation- 
eiTor bounds, based on a con traction property of the projected Bellman operator underlying TD 
dBertsekas & Tsitsiklisl Il996l) . Eor the off-policy case, however, standard TD methods no longer 
maint ain this contr action property, the error bounds do not hold, and these methods may even di¬ 
verge (lBairdLll99-5h . 

Recentlv. lSutton et alJ(l2015l) proposed the emphat ic TD (ETD) algorithm: a modiheation of the TD 
idea that can be shown to converge off-policy (lYull2015h . In this paper, we show that the projected 
Bellman operator underlying ETD also possesses a contraction property, which allows us to derive 
approximation-error bounds for ETD. 

In recent years, several different off-policy policy-evaluation algorith ms hav e been proposed and 
analyzed, such as importance -sampling based le ast-squares TD dYul 1201 2l) . gradient-based TD 
dSutton et alil2009l) . and ETD dSutton et al.Ll2015l) . While these algorithms were shown to converge, 
to our knowledge there are no guarantees on the error of the converged solution. The only exception 
that we are aware of, is a contraction-based argument for importance-sampling ba sed LSTD, under 
the restrictive assumption that the behavior and target policies are very similar dBertsekas & YuL 
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l2009h . This paper presents the first approximation-error bounds for off-policy policy evaluation 
under general target and behavior policies. 


2 Preliminaries 


We consider an MDP M = {S, A, P, R, 7 , p), where S is the state space, A is the action space, P is 
the transition probability matrix, R is the reward function, 7 £ [0,1) is the discount factor, and p is 
the initial state distribution. 


Given a target policy tt, our goal is to evaluate the value function: 


V^{s) =E’" 


^Rist,at) 


So 


s 


Temporal difference methods dSutton & BartoLll998ll . approximate the value function by 

V^s) « 

where £ M" are state features, and 9 £ R" are weights, and use sampling to find a suitable 
9. Let p denote a behavior policy that generates the samples sq, oq, Si, ai,... according to at ^ 
p{-\st) and - 1-1 ^ P( - \st,at ). We denote by pt the ratio 7 r(at|st)/p(at|st), and we assume, 
similarly to lSutton et al.l (l2015b . that p and tt are such that pt is well-defined for all t. 

Let T'^ denote the Bellman operator for policy tt, given by 

T^V = R^+ -fP^V, 

where and are the reward vector and transition matrix induced by policy tt, and let <!> denote 
a matrix whose columns are the feature vectors for all states. Let and denote the stationary 
distributions over states induced by the policies p and tt, respectively. For some d £ satisfying 
d > 0 element-wise, we denote by lid a projection to the subspace spanned by (j){s) with respect to 
the d-weighted Euclidean-norm. 

Similarly to lSutton et al.l (1201 5h . we divide the analysis to the ‘pure bootstrapping’ case A = 0, and 
the more general case with A £ [0,1). The ETD(O) algorithm iteratively updates the weight vector 
9 according to: 

9t+i := 9t + aFtPt{Rt+i + l9j4>t+i - 9jfOft 
Ft = jpt-iFt-i + 1, Fo = 1. 


The emphatic weight vector / is defined by 

/T = dT(/-7Pj-l. (1) 

The ETD(A) algorithm iteratively updates the weight vector 9 according to 

9t+i ■= 9t + a{Rt+i + ‘j9jft+i — 9jfOst 
et = ptilXet-i + Mtft), e_i=0 
Mt = Xi{St) + (1 - X)Ft 
Ft = pt-i^fFt-i + i{St), Fo = i{So), 

where i : S —» is a known given function signifying the importance of the state. Note that 

I Sutton et alJ (|2015|) consider state-dependent discount factor 7 ( 5 ) and bootstrapping parameter A(s), 
while in this paper we consider the special case where 7 and A are constant. 

The emphatic weight vector m is defined by 

=i^{I-P^)-\ ( 2 ) 

where: 

i(s) = i{s) •d^(s), 

P^ = I-{I-^XP.)-\I-^P^). 
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( 3 ) 


Notice that in the case of general A, the Bellman operator is: 

T^^)v = {I- 7AP^)-V^ + P^v. 

iMahmood et alJ (12015l) show that ETD converges to some 9* that is a solution of the projected 
fixed-point equation: 

In this paper, we establish that the projected Bellman operator is a contraction, which allows 

us to bound the error ||$^ 0 * — 

3 Results 

We start from E TD(O). It is well known that is a 7 -contraction with respect to the -weighted 
Euclidean norm (iBertsekas & Tsitsiklislll99^ . However, it is not immed iate that the con c atenat ion 
HfT'^ is a contraction in any norm. Indeed, for the TD(0) algorithm ISutton & Bartol (Il998h . a 
similar representation as a projected Bellman ope rator holds, but it may be shown that in the off- 
policy setting the algorithm diverges dBair dL ll995h . 

The following theorem shows that for ETD(O), the projected Bellman operator H/T’^ is indeed a 
contraction. 

Theorem 1. Denote by k = ming -j^, then IVfT^ is a — k)- contraction with respect to the 

Euclidean f-weighted norm, namely, 

||n/T’"ui - n/T’"u2||/ < \/l{V - k)||ui - U2II/, Vui, t;2 G 

Proof. Let F = diag{f). We have 

Ikll/ - 7ll-P7rw||/ = v^Fv - "/v^pJfPt,v 

>“ Fv — diag{f^PTr)v 
= [F - ydiagif^P^)]v (4) 

= [diag (/^(/ - 7 -P 77 ))] v 

v^diag{d^)v = ||u||^^, 

where (a) follows from the Jensen inequality: 

v^p2FP^v = ^ /(s)(^ 

S s' 

fis)^ P7r{s'\s)v^{s') 

. (5) 

= ^u2(s')^/(s)P^(s'|s) 

s' S 

= diagif^ P^)v, 


and (b) is by the definition of / in ([T]). 
Notice that for every v: 


MX = Kf{s)v'^{s) = k\\v\\} 

S S 


( 6 ) 


Therefore: 

Ikf/> 7l|J^.t;||^ + IHI^^ > 7l|P.t;||^ + 
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and: 


\\T-v,-T-V2\\} = \hPAvi-V2)\\} 

= j^PAvi-V2)\\} ( 8 ) 

< 7(1 - k)||?;i - V 2 \\}. 

Hence, T is a -^7(1 — «:)-contraction. Since H/ is a non-expansion in the /-weighted norm 
(iBertsekas & TsitsiklisLll99^ . H/T is a •\/7(l — k) -contraction as well. □ 

Notice that k obtains values ranging from k = 0 (when there is a state visited by the target policy, 
but not the behavior policy), to k = 1 — 7 (when the two policies are identical). In the latter case we 
obtain the classical bound: 1 / 7(1 — k) = 7 . This result resembles that of iKolterl (1201 lb who used 
the discrepancy between the behavior and the target policy to bound the TD-error. 

An immediate consequence of Theorem [T] is the following error bound, based on Lemma 6.9 of 
IBertsekas & TsitsiklisI (Il996h . 

Corollary 1. 'We have 

||$^r - 1/-1I/ < ^— y^==\\UfV^ - v^Wf. 

1-V 7 (l-'«) 

In a sense, the error || H/L’^ — ||/ is the best approximation we can hope for, within the capability 

of our linear approximation architecture. Corollary [T] guarantees that we are not too far away from 
it. 

Now we move on to the analysis of ETD(A): 

Theorem 2. is a ^/^-contraction with respect to the Euclidean f-weighted norm, where 

P = ■ Namely, 

Proof. The proof is almost identical to the proof of Theorem[T] only now we cannot apply Jensen’s 
inequality directly, since the rows of P^ do not sum to 1 . However: 

P/1 = (/-(/- jXP/-\l - jP/) 1 = /31, (9) 

and each entry of P/ is positive. Therefore will hold for Jensen’s inequality. Let M = diag{m), 
we have 

h\\'L-^\\PM\'L = '>PMv- 

p\ 

>“ Mv — diag{w/ —f-iv 

/3 ^ (10) 

= v^[M — diag{rri/ P^)]v 
= v~^ [diag {w/[I — P^))]v 
='' v^diagii)v = ||w||?, 

where (a) follows from the Jensen inequality and (b) from Equation|2] 

Therefore: 

\\vrm>^\\PMl+\\v\\f>^\\PML ( 11 ) 

and: 

= \\P^{V1 - V2)\\l < P\\V1 - V2\\l. (12) 

Hence, PN is a -/^-contraction. Since H^ is a non-expansion in the m-weighted norm 
(IBertsekas & Tsitsiklislll996h . is a /^-contraction as well. □ 
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As before. Theorem HI leads to the following error bound, based on Theorem 1 of 
iTsitsiklis & Van RovI ( Il997h . 

Corollary 2. 'We have 


We now show in an example that our contraction modulus bounds are tight. 


Example Consider an MDP with two states: Left and Right. In each state there are two identical 
actions leading to either Left or Right deterministically. The behavior policy will choose Right with 
probability e, and the target policy will choose Left with probability e. Calculating the quantities of 
interest: 


Pn = 
1 


e 1 - e 
e 1 - e 


= (1 - e,e) 


/ = -j-(1 + 267 - e - 7, - 2 e 7 + e + 7) . 

1-7 


So for V = (0,1)^: 


2 e + 7-2e7 

Hf = 


,2_ (l-e^ 


1 ~ 1 I 

1 — 7 i — 7 


and for small e we obtain that 


7 - 


4 Discussion 

Interestingly, the ETD error bounds in Corollary [T] and |2] are m ore conservative by a factor 
of square root than the en or bounds for standard on-policy TD dBertsekas & Tsitsiklisl Il996t 
ITsitsiklis & Van Ro^ll997h . Thus, it appears that there is a price to pay for off-policy convergence. 
Future work should address the implications of the different norms in these bounds. 

Ne verthe less, we believe that the results in this paper motivate ETD (or its least-squares counterpart; 
1^120 15h as the method of choice for off-policy policy-evaluation in MDPs. 
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